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Abstract 

The  problem  of  describing  images  through  natural  lan¬ 
guage  has  gained  importance  in  the  computer  vision  com¬ 
munity.  Solutions  to  image  description  have  either  focused 
on  a  top-down  approach  of  generating  language  through 
combinations  of  object  detections  and  language  models  or 
bottom-up  propagation  of  keyword  tags  from  training  im¬ 
ages  to  test  images  through  probabilistic  or  nearest  neigh¬ 
bor  techniques.  In  contrast,  describing  videos  with  natural 
language  is  a  less  studied  problem.  In  this  paper,  we  com¬ 
bine  ideas  from  the  bottom-up  and  top-down  approaches  to 
image  description  and  propose  a  method  for  video  descrip¬ 
tion  that  captures  the  most  relevant  contents  of  a  video  in  a 
natural  language  description.  We  propose  a  hybrid  system 
consisting  of  a  low  level  multimodal  latent  topic  model  for 
initial  keyword  annotation,  a  middle  level  of  concept  detec¬ 
tors  and  a  high  level  module  to  produce  final  lingual  de¬ 
scriptions.  We  compare  the  results  of  our  system  to  human 
descriptions  in  both  short  and  long  forms  on  two  datasets, 
and  demonstrate  that  final  system  output  has  greater  agree¬ 
ment  with  the  human  descriptions  than  any  single  level. 


1.  Introduction 

The  problem  of  generating  natural  language  descriptions 
of  images  and  videos  has  been  steadily  gaining  prominence 
in  the  computer  vision  community.  A  number  of  papers 
have  been  proposed  to  leverage  latent  topic  models  on  low- 
level  features  [4,  6,  7,  22,  32],  for  example.  The  problem 
is  important  for  three  reasons:  i)  transducing  visual  data 
into  textual  data  would  permit  well  understood  text-based 
indexing  and  retrieval  mechanisms  essentially  for  free;  ii) 
fine  grained  object  models  and  region  labeling  introduce  a 
new  level  of  semantic  richness  to  multimedia  retrieval  tech¬ 
niques;  and  iii)  grounding  representations  of  visual  data  in 
natural  language  has  great  potential  to  overcome  the  inher¬ 
ent  semantic  ambiguity  prominent  in  the  data-driven  high- 
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Output  from  our  system:  1)  A  person  is  on  artificial  rock  wall.  2)  A  person  climbing 
a  wall  is  on  artificial  rock  wall.  3)  Person  climbs  rock  wall  indoors.  4)  Young  man 
tries  to  climb  artificial  rock  wall.  5)  A  man  demonstrates  how  to  climb  a  rock  wall. 


Figure  1:  A  framework  of  our  hybrid  system  showing  a 
video  being  processed  through  our  pipeline  and  described 
by  a  few  natural  language  sentences. 


level  vision  community  (see  [27]  for  a  discussion  of  data-set 
bias  and  discussion  on  the  different  meanings  common  la¬ 
bels  can  have  within  and  across  data  sets). 

Fig.  1  shows  our  video  to  text  system  pipeline.  To  date, 
the  most  common  approach  to  such  lingual  description  of 
images  has  been  to  model  the  joint  distribution  over  low- 
level  image  features  and  language,  typically  nouns.  Early 
work  on  multimodal  topic  models  by  Blei  et  al.  [4]  and  sub¬ 
sequent  extensions  [6,  7,  11,  22,  32]  jointly  model  image 
features  (predominantly  SIFT  and  HOG  derivatives)  and 
language  words  as  mixed  memberships  over  latent  topics 
with  considerable  success.  Other  non-parametric  nearest- 
neighbor  and  label  transfer  methods,  such  as  Makadia  et  al. 
[18]  and  TagProp  [12],  rely  on  large  annotated  sets  to  gener¬ 
ate  descriptions  from  similar  samples.  These  methods  have 
demonstrated  a  capability  of  lingual  description  on  images 
at  varying  levels,  but  they  have  two  main  limitations.  Be¬ 
ing  based  on  low-level  features  and/or  similarity  measures, 
first,  it  is  not  clear  they  can  scale  up  as  the  richness  of  the 
semantic  space  increases.  Second,  the  generated  text  has 
largely  been  in  the  form  of  word-lists  without  any  semantic 
verification  (see  Sec.  2.3). 
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Alternatively,  a  second  class  of  approaches  to  lingual  de¬ 
scription  of  images  directly  seeks  a  set  of  high-level  con¬ 
cepts,  typically  objects  but  possibly  others  such  as  scene 
categories.  Prominent  among  object  detectors  is  the  de¬ 
formable  parts  model  (DPM)  [10]  and  related  visual  phrases 
[26]  which  have  been  successful  in  the  task  of  “annotating” 
natural  images.  Despite  being  able  to  guarantee  the  seman¬ 
tic  veracity  of  the  generated  lingual  description,  these  meth¬ 
ods  have  found  limited  use  due  to  the  overall  complexity  of 
object  detection  in-the-wild  and  its  constituent  limitations 
(i.e.,  noisy  detection),  and  the  challenge  of  enumerating  all 
relevant  world  concepts  and  learning  a  detector  for  each. 

In  this  work,  we  propose  a  hybrid  model  that  takes 
the  best  characteristics  of  these  two  classes  of  methods. 
Namely,  our  model  leverages  the  power  of  low-level  joint 
distributions  over  video  features  and  language  by  treating 
them  as  a  set  of  lingual  proposals  which  are  subsequently 
filtered  by  a  set  of  mid-level  concept  detectors.  A  test  video 
is  processed  in  three  ways  (see  Fig.  1).  First,  in  a  bot¬ 
tom  up  fashion,  low  level  video  features  predict  keywords. 
We  use  multimodal  latent  topic  models  to  find  a  proposal 
distribution  over  some  training  vocabulary  of  textual  words 
[4,  7],  then  select  the  most  probable  keywords  as  potential 
subjects,  objects  and  verbs  through  a  natural  language  de¬ 
pendency  grammar  and  part-of-speech  tagging. 

Second,  in  a  top  down  fashion,  we  detect  and  stitch  to¬ 
gether  a  set  of  concepts,  such  as  “artificial  rock  wall”  and 
“person  climbing  wall”  similar  to  [26],  which  are  then  con¬ 
verted  to  lingual  descriptions  through  a  tripartite  graph  tem¬ 
plate.  Third,  for  high  level  semantic  verification,  we  relate 
the  predicted  caption  keywords  with  the  detected  concepts 
to  produce  a  ranked  set  of  well  formed  natural  language  sen¬ 
tences.  Our  semantic  verification  step  is  independent  of  any 
computer  vision  framework  and  works  by  measuring  the 
number  of  inversions  between  two  ranked  lists  of  predicted 
keywords  and  detected  concepts  both  being  conditional  on 
their  respective  learned  topic  multinomials. 

Our  method  does  not  suffer  from  any  lack  of  semantic 
verification  as  bottom-up  models  do,  nor  does  it  suffer  from 
the  tractability  challenges  of  the  top-down  methods — it  can 
rely  on  fewer  well-trained  concept  detectors  for  verification 
allowing  the  correlation  between  different  concepts  to  re¬ 
place  the  need  for  a  vast  set  of  concept  detectors. 

Videos  vs.  Images  Recent  work  in  [9,  16,  34]  is 
mainly  focused  on  generating  fluent  descriptions  of  a  single 
image — images  not  videos.  Videos  introduce  an  additional 
set  of  challenges  such  as  temporal  variation/articulation  and 
dependencies.  Most  related  work  in  vision  has  focused  only 
on  the  activity  classification  side:  example  methods  us¬ 
ing  topic  models  for  activities  are  the  hidden  topic  Markov 
model  [33]  and  frame-by-frame  Markov  topic  models  [13], 
but  these  methods  do  not  model  language  and  visual  topics 
jointly.  A  recent  activity  classification  paper  of  relevance 


is  the  Action  Bank  method  [25],  which  ties  high-level  ac¬ 
tions  to  constituent  low-level  action  detections,  but  it  does 
not  include  any  language  generation  framework. 

The  three  most  relevant  works  to  ours  are  the  Khan  et  al. 
[14],  Barbu  et  al.  [1]  and  Malkamenkar  et  al.  [19]  systems. 
All  of  these  methods  extract  high-level  concepts,  such  as 
faces,  humans,  tables,  etc.,  and  generate  language  descrip¬ 
tion  by  template  filling;  [19]  additionally  uses  externally 
mined  language  data  to  help  rank  the  best  subject- verb- 
object  triplet.  The  methods  rely  directly  on  all  high-level 
concepts  being  enumerated  (the  second  class  of  methods  in¬ 
troduced  above)  and  hence  may  be  led  astray  by  noisy  de¬ 
tection  and  have  a  limited  vocabulary,  unlike  our  approach 
which  not  only  uses  the  high-level  concepts  but  augments 
them  with  a  large  corpus  of  lingual  descriptions  from  the 
bottom-up.  Furthermore,  some  have  used  datasets  have  sim¬ 
pler  videos  not  in-the-wild. 

We,  in  contrast,  focus  on  descriptions  of  general  videos 
(e.g.,  from  YouTube)  directly  through  bottom-up  visual  fea¬ 
ture  translations  to  text  and  top-down  concept  detections. 
We  leverage  both  detailed  object  annotations  and  human 
lingual  descriptions.  Our  proposed  hybrid  method  shows 
more  relevant  content  generation  over  simple  keyword  an¬ 
notation  of  videos  alone  as  observed  using  quantitative  eval¬ 
uation  on  two  datasets — the  TRECVID  dataset  [20]  and  a 
new  in-house  dataset  consisting  of  cooking  videos  collected 
from  YouTube  with  human  lingual  descriptions  generated 
through  MTurk  (Sec.  3). 

2.  System  Description 
2.1.  Low  Level:  Topic  Model 

Following  [7],  we  adapt  the 
GM-LDA  model  in  [4]  (dubbed 
MMLDA  for  MultiModalLDA  in 
this  paper)  to  handle  a  discrete  vi¬ 
sual  feature  space,  e.g.,  we  use 
HOG3D  [15].  The  original  model 
in  [4]  is  defined  in  the  continuous 

A  U1  ^  .  J—/V7  VV  ” 

space,  which  presents  challenges  level  topic  modeL 
for  discrete  features:  it  can  become 

unstable  during  deterministic  approximate  optimization  due 
to  extreme  values  in  high-dimensions  and  its  inherent  non¬ 
convexity  [30].  We  briefly  explain  the  model  and  demon¬ 
strate  how  it  is  instantiated  and  differs  from  the  original  ver¬ 
sion  in  [4] .  First,  we  use  an  asymmetric  Dirichlet  prior,  a 
for  the  document  level  topic  proportions  Qd  following  [31] 
unlike  the  symmetric  one  in  [4].  In  Fig.  2,  D  is  the  num¬ 
ber  of  documents,  each  consisting  of  a  video  and  a  lingual 
description  (the  text  is  only  available  during  training).  The 
number  of  discrete  visual  words  and  lingual  words  per  video 
document  d  are  N  and  M.  The  parameters  for  corpus  level 
topic  multinomials  over  visual  words  are  Pi:K.  The  param- 
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eters  for  corpus  level  topic  multinomials  over  textual  words 
are  (31:K — only  the  training  instances  of  these  parameters 
are  used  for  keyword  prediction.  The  indicator  variables  for 
choosing  a  topic  are  {zd,n}  and  {yd,m}\  Wd,m  is  the  text 
word  at  position  m  in  video  “document”  d  with  vocabulary 
size  V.  Each  Wd,n  is  a  visual  feature  from  a  bag-of-discrete- 
visual- words  at  position  n  with  vocabulary  size  corrV  and 
each  Wd,n  represents  a  visual  word  (e.g.,  HOG3D  [15]  in¬ 
dex,  transformed  color  histogram  [28],  etc.). 

We  use  the  mean  field  method  of  optimizing  a  lower- 
bound  to  the  true  likelihood  of  the  data.  A  fully  factorized 
q  distribution  with  “free”  variational  parameters  7,  0  and  A 
is  imposed  by:  g(0,  z,  y  I7,  0,  A)  = 


D 


IE(0d|7d) 

d= 1 


~Nd  Md 

n*(*.»l**n)  n  <l(yd,m\^d,m) 

_n=  1  m=  1 


•  CD 


The  optimal  values  of  free  variables  and  parame¬ 
ters  are  found  by  optimizing  the  lower  bound  on 
logp(wM,  wn|«,  /3,  p).  The  free  multinomial  parameters 
of  the  variational  topic  distributions  ascribed  to  the  corre¬ 
sponding  data  are  0ds.  The  free  parameters  of  the  varia¬ 
tional  word-topic  distribution  are  ArjS.  The  surrogate  for  the 
if -dimensional  ex  is  7d  which  represents  the  expected  num¬ 
ber  of  observations  per  document  in  each  topic.  The  free 
parameters  are  defined  for  every  video  document  d.  The 
optimal  value  expressions  of  the  hidden  variables  in  video 
document  d  for  the  MMLDA  model  are  as  follows: 


K,i  OC  exp  {V’(li)  +  l°g  Pi,wd,n  }  , 

(2) 

‘m, i  oc  exp  { V’(li)  +  log  Pi,wd,m  }  , 

(3) 

Nd  Md 

li  =  O' 2  +  ^  ^  Pn,i  “t”  \  ; 

(4) 

n= 1  m=  1 


A=200 

n=  1 

n= 5 

71=10 

71=15 

MMLDA 

0.03518 

0.11204 

0.18700 

0.24117 

Corr-MMLDA 

0.03641 

0.11063 

0.18406 

0.24840 

Table  1 :  Average  word  prediction  1-gram  recall  for  different 
topic  models  with  200  topics  when  the  full  corpus  is  used. 
The  numbers  are  slightly  lower  for  lower  number  of  topics 
but  are  not  statistically  significant. 


The  correspon¬ 
dence  between  Wd,m 
and  Zd,n  necessitates 
checking  for  corre¬ 
spondence  strengths 
over  all  possible  de¬ 
pendencies  between 
Wd,m  and 
This  assumption 
is  relaxed  in  the 


Corr -MMLDA 


K=30  K=45  K=100  K=200 


Figure  3:  Prediction  ELBOs 
from  the  two  topic  models  for 
the  videos  in  TRECVID  dataset. 
Lower  is  better. 


MMLDA  model  and  removes  the  bottleneck  in  runtime 
efficiency  for  high  dimensional  video  features  without 
showing  significant  performance  drain.  Fig.  3  shows  the 
held  out  log  likelihoods  or  the  Evidence  Lower  BOunds 
(ELBOs)  on  part  of  the  TRECVID  dataset  (Sec.  3.1). 
The  figures  are  obtained  by  topic  modeling  on  the  entire 
corpus  of  multimedia  documents  (video  with  corresponding 
lingual  description).  Using  visual  features,  we  predict  the 
top  n  words  as  the  description  of  the  test  videos.  Table  1 
shows  the  average  1-gram  recall  of  predicted  words  (as  in 
[14]).  We  observe  that  both  models  have  approximately  the 
same  fit  and  word  prediction  power,  and  hence  choose  the 
MMLDA  model  since  it  is  computationally  less  expensive. 


2.2.  Middle  Level:  Concepts  to  Language 


where  0  is  the  digamma  function.  The  expressions  for  the 
maximum  likelihood  of  the  topic  parameters  are: 

D  Nd  corrV 

Pi  ,j  CK  EE  E  4>d,n,i3('Wd,m  j)  i  (5) 

d=ln=l  j= 1 
D  Md  V 

fe  =  EEE  Ad  ,ra,  iS{wd 

,mj  j)  •  (6) 

d=  1  m=  1  j= 1 

The  asymmetric  a  is  optimized  using  the  formulations 
given  in  [5],  which  incorporates  Newton  steps  as  search  di¬ 
rections  in  gradient  ascent. 

A  strongly  constrained  model,  Corr-LDA,  is  also  intro¬ 
duced  in  [4]  that  uses  real  valued  visual  features  and  shows 
promising  image  annotation  performance.  We  have  exper¬ 
imented  with  the  model  to  use  our  discrete  visual  feature 
space  (and  name  it  Corr-MMLDA)  but  finally  opt  to  not  use 
it  in  our  final  experiments  due  to  the  following  reasons. 


The  middle  level  is  a  top-down  approach  that  detects 
concepts  sparsely  throughout  the  video,  matches  them  over 
time,  which  we  call  stitching ,  and  relates  them  to  a  tripartite 
template  graph  for  generating  language  output. 

2.2.1  Concept  Detectors 

Instead  of  using  publicly  available  object  detectors  from 
datasets  like  the  PASCAL  VOC  [8],  or  training  indepen¬ 
dent  object  detectors  for  objects  such  as  microphone ,  we 
build  the  concept  object  detectors  like  microphone  with  up¬ 
per  body ,  group  of  people  etc.,  where  multiple  objects  to¬ 
gether  form  a  single  concept.  A  concept  detector  captures 
richer  semantic  information  (from  object,  action  and  scene 
level)  than  object  detectors,  and  usually  reduces  the  visual 
complexity  compared  to  individual  objects,  which  requires 
less  training  examples  for  an  accurate  detector.  These  con¬ 
cept  detectors  are  closely  related  to  Sadeghi  and  Farhadi’s 
visual  phrases  [26]  but  do  not  use  any  decoding  process  and 
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person  with  microphone 


person  climbing  wall 


Figure  4:  Examples  of  DPM  based  concept  detectors, 
are  applied  on  video. 

We  use  the  deformable  parts  model  (DPM)  [10]  for  the 
concept  detectors,  some  examples  of  which  are  visualized 
in  Fig.  4.  The  specific  concepts  we  choose  are  based  on  the 
most  frequently  occurring  object-groupings  in  the  human 
descriptions  from  the  training  videos.  We  use  the  VATIC 
tool  [29]  to  annotate  the  trajectories  of  concept  detectors 
in  training  videos,  which  are  also  used  in  Sec.  2.2.3  for 
extracting  concept  relations. 

2.2.2  Sparse  Object  Stitching  (SOS) 

Concept  detectors  act  as  a  proxy  to  the  trajectories  being 
tracked  in  a  video.  However,  tracking  over  detection  is  a 
challenging  and  open  problem  for  videos  in-the-wild.  First, 
camera  motion  and  the  frame  rate  are  unpredictable,  ren¬ 
dering  the  existing  tracking  methods  useless.  Second,  the 
scale  of  our  dataset  is  huge  (thousands  of  video  hours),  and 
we  hence  need  a  fast  alternative.  Our  approach  is  called 
sparse  object  stitching ;  we  sparsely  obtain  the  concept  de¬ 
tections  in  a  video  and  then  sequentially  group  frames  based 
on  commonly  detected  concepts. 

For  a  given  video,  we  run  the  set  of  concept  detec¬ 
tors  C  on  T  sparsely  distributed  frames  (e.g.  1  frame/sec) 
and  denote  the  set  of  positive  detections  on  each  frame  as 
Vi.  The  algorithm  tries  to  segment  the  video  into  a  set  of 
concept  shots  S  =  {Si,  S2,  •  •  • ,  Sz},  where  S  =  U Vi, 
and  Z  «  T,  so  that  each  Sj  can  be  independently  de¬ 
scribed  by  some  sparse  detections  similar  in  spirit  to  [14]. 
We  start  by  uniformly  splitting  the  video  into  K  proposal 
shots  {S[,  S'2, . . . ,  S'K}.  Then  we  greedily  traverse  the  pro¬ 
posed  shots  one  by  one  considering  neighboring  shots  Sk 
and  S{+1.  If  the  Jaccard  distance  J(S£,  Sj(.+1)  =  1  — 

|^u^+1|  l°wer  than  a  threshold  a  (set  as  0.5  using  cross- 
validation),  then  we  merge  these  two  proposed  shots  into 
one  shot  and  compare  it  with  the  next  shot,  otherwise  shot 
S'k  is  an  independent  shot.  For  each  such  concept  shot,  we 
match  it  to  a  tripartite  template  graph  and  translate  it  to  lan¬ 
guage,  as  we  describe  next. 


Figure  5:  Lingual  descriptions  from  tripartite  template 
graphs  consisting  of  concepts  as  vertices. 


2.2.3  Tripartite  Template  Graph 

We  use  a  tripartite  graph  Q  =  (Vs,  V1 ,  V°,  E) — Vs  for 
human  subjects,  V1  for  tools,  and  V°  for  objects — that 
takes  the  concept  detections  from  each  Sj  and  generates 
template-based  language  description.  The  vertex  set  V  = 
Vs  U  V1  U  V°  is  identical  to  the  set  of  concept  detectors  C 
in  the  domain  at  hand.  Each  concept  detector  is  assigned  to 
one  of  the  three  vertex  sets  (see  Fig.  5).  The  set  of  paths 
V  =  {(Er,n,  E^^)\t  G  Vs ,  p  G  V1 ,  v  G  V°}  is  defined  as 
all  valid  paths  from  Vs  to  V°  through  Vf,  and  each  forms 
a  possible  language  output.  However,  we  prune  V  so  that  it 
contains  only  those  valid  paths  that  were  observed  in  the  an¬ 
notated  training  sequences.  For  a  given  domain,  each  such 
path,  or  triplet  (Vs,  Vf,  V°),  instantiates  a  manually  created 
template,  such  as  “(Vs)  is  cleaning  (V°)  with  (V*).” 

Language  Output:  Given  the  top  confident  concept  de¬ 
tections  Cc  C  C  in  one  concept  shot  Sj ,  we  activate  the  set 
of  paths  Vc  C  V.  A  natural  language  sentence  is  output 
for  paths  containing  a  common  subject  using  the  template 
(Vs,  V1 ,  V°).  For  situations  where  £CD  V1  —  0,  the  consis¬ 
tency  of  the  tripartite  graph  is  maintained  through  a  default 
“BYPASS”  node  in  V1  (see  Figs.  1  and  5).  This  node  acts 
as  a  “backspace”  production  rule  in  the  final  lingual  out¬ 
put  thereby  connecting  the  subject  to  an  object  effectively 
through  a  single  edge.  There  is,  similarly,  a  BYPASS  node 
in  V°  as  well.  In  this  paper,  we  generally  do  not  consider 
the  situation  that  Cc  D  Vs  =  0,  in  which  no  human  subject  is 
present.  Histogram  counts  are  used  for  ranking  the  concept 
nodes  for  the  lingual  output. 

Fig.  5  depicts  a  visual  example  of  this  process.  The 
edges  represent  the  action  phrases  or  function  words  that 
stitch  the  concepts  together  cohesively.  For  example,  con¬ 
sider  the  following  structure:  “([a  person  with  microphone]) 
is  speaking  to  ([a  large  group  of  sitting  people]  and  [a  small 
group  of  standing  people])  with  ([a  camera  man]  and  [board 
in  the  back]).”  Here  the  parentheses  encapsulate  a  sim¬ 
ple  conjunctive  production  rule  and  the  phrases  inside  the 
square  brackets  denote  human  subjects,  tools  or  objects. 
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The  edge  labels  in  this  case  are  “is  speaking  to”  and  “with” 
which  are  part  of  the  template  ( Vs ,  V°).  In  the  figure, 

Cc  is  colored  blue  and  edges  in  Vc  with  the  common  vertex 
“microphone- with-upper-body”  are  colored  red.  We  delete 
repeated  sentences  in  the  final  description. 

2.3.  High  Level:  Semantic  Verification 

The  high  level  system  joins  the  two  earlier  sets  of  lingual 
descriptions  (from  the  low  and  middle  levels)  to  enhance  the 
set  of  sentences  given  from  the  middle  level  and  at  the  same 
time  to  filter  the  sentences  from  the  low  level.  Our  method 
takes  the  predicted  words  from  the  low  level  and  tags  their 
part-of- speech  (POS)  with  standard  NLP  tools.  These  are 
used  to  retrieve  weighted  nearest  neighbors  from  the  train¬ 
ing  descriptions,  which  are  then  ranked  according  to  pre¬ 
dictive  importance,  similar  in  spirit  to  how  Farhadi  et  al. 
[9]  select  sentences.  In  contrast,  we  rank  over  semantically 
verified  low  level  sentences,  giving  higher  weight  to  shorter 
sentences  and  a  fixed  preference  to  middle  level  sentences. 

We  use  the  dependency  grammar  and  part-of- speech 
(POS)  models  in  the  Stanford  NLP  Suite*  to  create  an¬ 
notated  dictionaries  based  on  word  morphologies;  the 
human  descriptions  provide  the  input.  The  predicted 
keywords  from  the  low  level  topic  models  are  labeled 
through  these  dictionaries.  For  more  than  two  POS 
for  the  same  morphology,  we  prefer  verbs,  but  other 
variants  can  be  retained  as  well  without  loss  of  gener¬ 
ality.  For  the  video  in  Fig.  5,  we  obtain  the  following 
labeled  top  15  keywords:  “ hall/OBJ  town/NOUN  meet¬ 
ing /VERB  man/S UBJ -HUMAN  speaks/VERB  microphone/OBJ 
talking/VERB  representative/SUBJ-HUMAN  health/NOUN 
care/NOUN  politician/SUBJ-HUMAN  chairs/NOUN  flags/OBJ 
people/OBJ  crowd/OBJ.”  The  word  annotation  classes  used 
are  Subjects,  Verbs,  Objects,  Nouns  and  “Other.”  Subjects 
which  can  be  humans  (SUBJ-HUMAN)  are  determined 
using  WordNet  synsets. 

To  obtain  the  final  lingual  description  of  a  test  video,  the 
output  from  the  middle  level  is  used  first.  If  there  happen 
to  be  no  detections,  we  rely  only  on  the  low-level  generated 
sentences.  For  semantic  verification,  we  train  MMLDA  on 
a  vocabulary  of  training  descriptions  and  training  concept 
annotations  available  using  VATIC.  Then  we  compute  the 
number  of  topic  rank  inversions  for  two  ranked  lists  of  the 
top  P  predictions  and  top  C  detections  from  a  test  video  as: 

v  P 

*  =  ££  p(Wm\Pk)5{Wm,j) 

j=l  m= 1 
corrV  C 

Lconcepts  —  (  \  k  I  P(Wn\pk)S(w„,j)  >  )  .  (7) 


L keywords  — 


*nlp . Stanford. edu/ sof tware/corenlp . shtml 


If  the  number  of  inversions  is  less  than  a  threshold  (< 
\/P  +  C)  then  the  keywords  are  semantically  verified  by 
the  detected  concept  list. 

Finally,  we  retrieve  nearest  neighbor  sentences  from  the 
training  descriptions  by  a  ranking  function.  Each  sentence  s 
is  ranked  as:  rs  =  bh{w\xSl  +  W2XS2 )  where  b  is  a  boolean 
variable  indicating  that  a  sentence  must  have  at  least  two  of 
the  labeled  predictions,  which  are  verified  by  the  class  of 
words  to  which  the  concept  models  belong.  The  boolean 
variable  h  indicates  the  presence  of  at  least  one  human 
subject  in  the  sentence.  The  variable  indicating  the  total 
number  of  matches  divided  by  the  number  of  words  in  the 
sentence  is  xSl — this  penalizes  longer  and  irrelevant  sen¬ 
tences.  The  sum  of  the  weights  of  the  predicted  words  from 
the  topic  model  in  the  sentence  is  xS2 — the  latent  topical 
strength  is  reflected  here.  Each  of  x8l  and  xS2  is  normal¬ 
ized  over  all  matching  sentences.  The  weights  for  sentence 
length  penalty  and  topic  strength  respectively  are  w\  and 
(set  to  be  equal  in  our  implementation). 

3.  Experimental  Setup  and  Results 
3.1.  Datasets  and  Features 

TRECVID  MED12  dataset:  The  first  dataset  we  use  for 
generating  lingual  descriptions  of  real  life  videos  is  part 
of  TRECVID  Multimedia  Event  Detection  (MED  12)  [20]. 
The  training  set  has  25  event  categories  each  containing 
about  200  videos  of  positive  and  related  instances  of  the 
event  descriptions.  For  choosing  one  topic  model  over  an¬ 
other  (Sec.  2.1)  we  use  the  positive  videos  and  descriptions 
in  the  25  training  events  and  predict  the  words  for  the  pos¬ 
itive  videos  for  the  first  five  events  in  the  Dev-T  collection. 
The  descriptions  in  the  training  set  consist  of  short  and  very 
high  level  descriptions  of  the  corresponding  videos  ranging 
from  2  to  42  words  and  averaging  10  words  with  stop  words. 
We  use  68  concept  models  on  this  dataset. 

A  separate  dataset  released  as  part  of  the  Multimedia 
Event  Recounting  (MER)  task  contains  six  test  videos  per 
event  where  the  five  events  are  selected  from  the  25  events 
for  MED  12.  These  five  events  are:  1)  Cleaning  an  appli¬ 
ance ;  2)  Renovating  a  home ;  3)  Rock  Climbing ;  4)  Town 
hall  meeting ;  5)  Working  on  a  metal  crafts  project.  Since 
this  MER  12  test  set  cannot  be  publicly  released  for  obtain¬ 
ing  descriptions,  we  employ  in-house  annotators  (blinded  to 
our  methodology)  to  write  one  description  for  each  video. 

In-house  “YouCook”  dataset  on  cooking  videos:  We 

have  also  collected  a  new  dataset  for  this  video  descrip¬ 
tion  task,  which  we  call  YouCook.  The  dataset  consists  of 
88  videos  downloaded  from  YouTube,  roughly  uniformly 
split  into  six  different  cooking  styles,  such  as  baking  and 
grilling.  The  videos  all  have  a  third-person  viewpoint,  take 
place  in  different  kitchen  environments,  and  frequently  dis¬ 
play  dynamic  camera  changes.  The  training  set  consists  of 
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49  videos  with  object  annotations.  The  test  set  consists  of 
39  videos.  The  objects  for  YouCook  are  in  the  categories  of 
utensils  (31%),  bowls  (38%),  foods  and  other;  with  10  dif¬ 
ferent  object  classes  for  utensils  and  bowls  (we  discard  the 
other  classes  in  this  paper  because  of  too  few  instances). 

We  use  MTurk  to  obtain  multiple  human  descriptions  for 
each  video.  The  annotators  are  shown  an  example  video 
with  a  sample  description  focusing  on  the  actions  and  ob¬ 
jects  therein.  Participants  in  MTurk  are  instructed  to  watch 
a  cooking  video  as  many  times  as  required  to  lingually  de¬ 
scribe  the  video  in  at  least  three  sentences  totaling  a  min¬ 
imum  of  15  words.  We  set  our  minimum  due  to  the  com¬ 
plex  nature  of  the  micro-actions  in  this  dataset.  The  average 
number  of  words  per  summary  is  67,  the  average  number 
of  words  per  sentence  is  10  with  stop  words  and  the  average 
number  of  descriptions  per  video  is  eight.  The  recent  data 
set  [24]  is  also  about  cooking  but  it  has  a  fixed  scene  and  no 
object  annotations. 

Our  new  YouCook  dataset,  its  annotations  and  descrip¬ 
tions,  and  the  train/test  splits  are  available  at  http :  /  / www . 
cse. buffalo.edu/~j corso/r/ youcook. 

Low  Level  Features  for  Topic  Model:  We  use  three  dif¬ 
ferent  types  of  low  level  video  features:  (1)  HOG3D  [15], 
(2)  color  histograms,  and  (3)  transformed  color  histograms 
(TCH)  [28].  HOG3D  [15]  describes  local  spatiotempo- 
ral  gradients.  We  resize  the  video  frames  such  that  the 
largest  dimension  (height  or  width)  is  160  pixels,  and  ex¬ 
tract  HOG3D  features  from  a  dense  sampling  of  frames.  We 
then  use  K-means  clustering  to  create  a  4000-word  code¬ 
book  for  the  MED  12  data,  and  a  1000- word  codebook  for 
the  YouCook  data,  due  to  sparsity  of  the  dataset  follow¬ 
ing  [3].  Color  histograms  are  computed  using  512  RGB 
color  bins.  Further,  they  are  computed  over  each  frame  and 
merged  across  the  video.  Due  to  large  deviations  in  the  ex¬ 
treme  values,  we  use  the  histogram  between  the  lhth  and 
85th  percentiles  averaged  over  the  entire  video.  To  account 
for  poor  resolution  in  some  videos,  we  also  use  the  TCH 
features  [28]  with  a  4096  dimension  codebook. 

For  a  given  description  task,  the  event  type  is  assumed 
known  (specified  manually  or  by  some  prior  event  detection 
output);  we  hence  learn  separate  topic  models  for  each  event 
that  vary  based  on  the  language  vocabulary.  However,  the 
visual  feature  codebooks  are  not  event  specific.  When  learn¬ 
ing  each  specific  topic  model,  we  use  5-fold  cross  validation 
to  select  the  subset  of  best  performing  visual  features.  For 
example,  on  YouCook,  we  ultimately  use  HOG3D  and  color 
histograms,  whereas  on  most  of  MED  12  we  use  HOG3D 
and  TCH  (selected  through  cross-validation). 

3.2.  Quantitative  Evaluation 

We  use  the  ROUGE  [17]  tool  to  evaluate  the  level  of  rel¬ 
evant  content  generated  in  our  system  output  video  descrip¬ 
tions.  As  used  in  [34],  ROUGE  is  a  standard  for  compar¬ 


ing  text  summarization  systems  that  focuses  on  recall  of 
relevant  information  coverage.  ROUGE  allows  a  perfect 
score  of  1.0  in  case  of  a  perfect  match  given  only  one  refer¬ 
ence  description.  The  BLEU  [21]  scorer  is  more  precision 
oriented  and  is  useful  for  comparing  accuracy  and  fluency 
(usually  using  4-grams)  of  the  outputs  of  text  translation 
systems  as  used  in  [2,  16]  which  is  not  our  end  task. 

Quantitative  evaluation  itself  is  a  challenge — in  the 
UIUC  PASCAL  sentence  dataset  [23],  five  sentences  are 
used  per  image.  On  the  other  hand  we  only  allow  at  most 
five  sentences  per  video  per  level  -  low  or  middle  up  to  a 
maximum  of  ten.  A  human,  on  the  other  hand,  can  typically 
describe  a  video  in  just  one  sentence. 

Table  2  shows  the  ROUGE- 1  recall  and  precision  scores 
obtained  from  the  different  outputs  from  our  system  for  the 
MER12  test  set.  In  Tables  2  and  3,  “Low”  is  the  sentence 
output  from  our  low  level  topic  models  and  NLP  tools, 
“Middle”  is  the  output  from  the  middle  level  concepts, 
“High”  is  the  semantically  verified  final  output.  We  use 
the  top  15  keywords  with  redundancy  particularly  retaining 
subjects  like  “man,”  “woman”  etc.  and  verb  morphologies 
(which  otherwise  stem  to  the  same  prefix)  as  proxies  for 
ten- word  training  descriptions.  All  system  descriptions  are 
sentences,  except  the  baseline  [7],  which  is  keywords. 

From  Table  2,  it  is  clear  that  lingual  descriptions  from 
both  the  low  and  middle  levels  of  our  system  cover  more 
relevant  information,  albeit,  at  the  cost  of  introducing  addi¬ 
tional  words.  Increasing  the  number  of  keywords  improves 
recall  but  precision  drops  dramatically.  The  drop  in  preci¬ 
sion  for  our  final  output  is  also  due  to  increased  length  of  the 
descriptions.  However,  the  scores  remain  within  the  95% 
confidence  interval  of  that  from  the  keywords  for  “Renovat¬ 
ing  home,”  “Town  hall  meeting”  and  “Metal  crafts  project” 
events.  The  “Rock  climbing”  event  has  very  short  descrip¬ 
tions  as  human  descriptions  and  the  “Cleaning  an  appli¬ 
ance”  event  is  a  very  hard  event  both  for  DPM  as  well  as 
MMLDA  since  multiple  related  concepts  indicative  of  ap¬ 
pliances  in  context  appear  in  prediction  and  detection.  From 
Table  2  we  see  the  efficacy  of  the  short  lingual  descriptions 
from  the  middle  level  in  terms  of  precision  while  the  final 
output  of  our  system  significantly  outperforms  relevant  con¬ 
tent  coverage  of  the  lingual  descriptions  from  the  other  in¬ 
dividual  levels  with  regards  to  recall. 

Table  3  shows  ROUGE  scores  for  both  1-gram  and  2- 
gram  comparisons.  R1  means  ROUGE- 1 -Recall  and  PI 
means  ROUGE- 1 -Precision.  Similarly  for  R2  and  P2.  The 
length  of  all  system  summaries  is  truncated  at  67  words 
based  on  the  average  human  description  length.  The  sen¬ 
tences  from  the  low  level  are  chosen  based  on  the  top  15 
predictions  only.  For  fair  comparison  on  recall,  the  number 
of  keywords  ([7]  columns  in  Table  3)  is  chosen  to  be  67. 
The  numbers  in  bold  are  significant  at  95%  confidence  over 
corresponding  columns  on  the  left.  R2  is  non-zero  for  key- 
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Events 

[7] 

Precision 

Low  Middle 

High 

[7] 

Recall 

Low  Middle 

High 

Cleaning  appliance 

20.03 

17.52 

11.69^ 

10.68'*) 

19.16)“) 

32.60 

35.76 

48.15 

Renovating  home 

6.66 

15.29 

12.55 

9.99 

7.31)-) 

43.41 

30.67 

49.52 

Rock  climbing 

24.45 

16.2U*) 

24.52 

12.61<*) 

44.09 

59.22 

46.23 

65.84 

Town  hall  meeting 

17.35 

14.41 

27.56 

13.36 

13.80)-) 

28.66 

45.55 

56.44 

Metal  crafts  project 

16.73 

18.12 

31.68 

15.63 

19.01)-) 

41.87 

25.87 

54.84 

Table  2:  ROUGE- 1  PR  scores  for  the  MER12  test  set.  A  (— )  for  the  Recall- [7]  column  means  significantly  lower  performance 
than  the  next  3  columns.  The  bold  numbers  in  the  last  column  is  significantly  better  than  the  previous  3  columns  in  terms  of 
recall.  The  bold  numbers  in  Precision-Middle  column  are  significantly  better  than  those  in  Precision- [7]  column.  A  (*)  in 
columns  3,  4  or  5  means  significantly  lower  than  Precision- [7].  A  95%  confidence  interval  is  used  for  significance  testing. 


[7] 

High 

P2 

PI 

R2 

R1 

P2 

PI 

R2 

R1 

6E-4 

15.47 

6E-4 

19.02 

5.04 

24.82 

6.81 

34.2 

Table  3:  ROUGE  scores  for  our  “YouCook”  dataset. 

words  since  some  paired  keywords  are  indeed  phrases.  Our 
method  thus  performs  significantly  well  even  when  com¬ 
pared  against  longer  descriptions.  Our  lingual  descriptions 
built  on  top  of  concept  labels  and  just  a  few  keywords  sig¬ 
nificantly  outperform  labeling  with  even  four  times  as  large 
a  set  of  keywords.  This  can  also  tune  language  models  to 
context  since  creating  a  sentence  out  of  the  predicted  nouns 
and  verbs  does  not  increase  recall  based  on  unigrams. 

3.3.  Qualitative  Examples 

The  first  four  rows  in  Fig.  6  show  examples  from  the 
MER12  test  set.  The  first  one  or  two  italicized  sentences 
in  each  row  are  the  result  of  the  middle  level  output.  The 
“health  care  reform”  in  the  second  row  is  a  noise  phrase 
that  actually  cannot  be  verified  though  our  middle  level  but 
remains  in  the  description  due  to  our  conservative  ranking 
formula.  Next  we  show  one  good  and  one  bad  example 
from  our  YouCook  dataset.  The  human  descriptions  in  the 
last  two  rows  are  shown  for  the  purpose  of  illustrating  their 
variance  and  yet  their  relevancy.  The  last  cooking  video  has 
a  low  R1  score  of  21%  due  to  imprecise  predictions  and 
detections. 

4.  Conclusion 

In  this  paper  we  combine  the  best  aspects  of  top-down 
and  bottom-up  methods  of  producing  lingual  descriptions 
of  videos  in- the -wild  that  exploit  the  rich  semantic  space  of 
both  text  and  visual  features.  Our  contribution  is  unique 
in  that  the  class  of  concept  detectors  semantically  verify 
low  level  predictions  from  the  bottom  up  and  leverage  both 
sentence  generation  and  selection  that  together  outperforms 
output  from  the  independent  modules.  Our  future  work  will 


emphasize  scalability  in  the  semantic  space  to  increase  the 
generality  of  plausible  lingual  descriptions. 
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Cleaning  an  appliance 


Keywords:  refrigerator/OBJ  cleans/VERB  man/S UBJ-HUMAN  clean/VERB  blender/OBJ  cleaning/VERB  woman/SUBJ-HUMAN 
person/SUBJ-HUMAN  stove/OBJ  microwave/OBJ  sponge/NOUN  food/OBJ  home/OBJ  hose/OBJ  oven/OBJ 

Sentences  from  Our  System  1 .  A  person  is  using  dish  towel  and  hand  held  brush  or  vacuum  to  clean  panel  with  knobs  and 
washing  basin  or  sink.  2.  Man  cleaning  a  refrigerator.  3.  Man  cleans  his  blender.  4.  Woman  cleans  old  food  out  of  refrigerator. 
5.  Man  cleans  top  of  microwave  with  sponge. 

Human  Synopsis:  Two  standing  persons  clean  a  stove  top  with  a  vacuum  clean  with  a  hose. 


Town  hall  meeting 


Keywords:  meeting/VERB  town/NOUN  hall/OBJ  microphone/OBJ  talking/VERB  people/OBJ  podium/OBJ  speech/OBJ  woman/SUBJ- 
HUMAN  man/SUBJ-HUMAN  chairs/NOUN  clapping/VERB  speaks/VERB  questions/VERB  giving/VERB 

Sentences  from  Our  System  1 .  A  person  is  speaking  to  a  small  group  of  sitting  people  and  a  small  group  of  standing  people 
with  board  in  the  back.  2.  A  person  is  speaking  to  a  small  group  of  standing  people  with  board  in  the  back.  3.  Man  opens 
town  hall  meeting.  4.  Woman  speaks  at  town  meeting.  5.  Man  gives  speech  on  health  care  reform  at  a  town  hall  meeting. 
Human  Synopsis:  A  man  talks  to  a  mob  of  sitting  persons  who  clap  at  the  end  of  his  short  speech. 
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Renovating  home 


Keywords:  people/SUBJ-HUMAN,  home/OBJ,  group/OBJ,  renovating/VERB,  working/VERB,  montage/OBJ,  stop/VERB,  motion/ 
OBJ,  appears/VERB,  building/VERB,  floor/OBJ,  tiles/OBJ,  floorboards/OTHER,  man/SUBJ-HUMAN,  laying/VERB 

Sentences  from  Our  System:  1.  A  person  is  using  power  drill  to  renovate  a  house.  2.  A  crouching  person  is  using  power 
drill  to  renovate  a  house.  3.  A  person  is  using  trowel  to  renovate  a  house.  4.  man  lays  out  underlay  for  installing  flooring.  5. 
A  man  lays  a  plywood  floor  in  time  lapsed  video. 

Human  Synopsis:  Time  lapse  video  of  people  making  a  concrete  porch  with  sanders,  brooms,  vacuums  and  other  tools. 


Metal  crafts  project 


Keywords:  metal/OBJ  man/SUBJ-HUMAN  bending/VERB  hammer/VERB  piece/OBJ  tools/OBJ  rods/OBJ  hammering/VERB  craft/ 
VERB  iron/OBJ  workshop/OBJ  holding/VERB  works/VERB  steel/OBJ  bicycle/OBJ 

Sentences  from  Our  System  1.  A  person  is  working  with  pliers.  2.  Man  hammering  metal.  3.  Man  bending  metal  in 
workshop.  4.  Man  works  various  pieces  of  metal.  5.  A  man  works  on  a  metal  craft  at  a  workshop. 

Human  Synopsis:  A  man  is  shaping  a  star  with  a  hammer. 


Cooking  video:  High  ROUGE  score 


Keywords:  bowl/OBJ  pan/OBJ  video/OBJ  adds/VERB  lady/OBJ  pieces/OBJ  ingredients/OBJ  oil/OBJ  glass/OBJ  liquid/OBJ 
butter/SUBJ-HUMAN  woman/SUBJ-HUMAN  add/VERB  stove/OBJ  salt/OBJ 

Sentences  from  Our  System:  l.A person  is  cooking  with  bowl  and  stovetop .  2.  In  a  pan  add  little  butter.  3.  She  adds 
some  oil  and  a  piece  of  butter  in  the  pan.  4.  A  woman  holds  up  Bisquick  flour  and  then  adds  several  ingredients  to  a  bowl.  5. 
A  woman  adds  ingredients  to  a  blender. 

Human  Synopsisl:  A  lady  wearing  red  colored  dress,  blending  (think  butter)  in  a  big  sized  bowl.  Besides  there  is  2  small  bowls 
containing  white  color  powders.  It  may  be  maida  flour  and  sugar.  After  she  is  mixing  the  both  powders  in  that  big  bowl  and  blending 
together.  Human  Synopsis2:  In  this  video,  a  woman  first  adds  the  ingredients  from  a  plate  to  a  large  porcelain  bowl.  She  then  adds 
various  other  ingredients  from  various  different  bowls.  She  then  mixes  all  the  ingredients  with  a  wooden  spoon. 


Cooking  video:  Low  ROUGE  score 


Keywords:  bowl/OBJ  pan/OBJ  video/OBJ  adds/VERB  ingredients/OBJ  lady/OBJ  woman/SUBJ-HUMAN  add/VERB  pieces/OBJ 
stove/OBJ  oil/OBJ  put/VERB  added/VERB  mixes/VERB  glass/OBJ 

Sentences  from  Our  System:  1 .  A  person  is  cooking  with  pan  and  bowl.  2.  A  person  is  cooking  with  pan.  2.  A  woman  adds 
ingredients  to  a  blender.  2.  In  this  video,  a  woman  adds  a  few  ingredients  in  a  glass  bowl  and  mixes  them  well.  3.  In  this 
video,  a  woman  first  adds  the  ingredients  from  a  plate  to  a  large  porcelain  bowl  4.  The  woman  is  mixing  some  ingredients  in 
a  bowl.  5.  the  woman  in  the  video  has  a  large  glass  bowl. 

Human  Synopsisl:  The  woman  is  giving  directions  on  how  to  cook  bacon  omelette.  She  shows  the  ingredients  for  cooking  and  was 
frying  the  bacon,  scrambling  the  egg,  melting  the  butter  and  garnishing  it  with  onions  and  placed  some  cheese  on  top.  The  woman 
then  placed  the  scrambled  egg  and  bacon  to  cook  and  then  placed  it  on  a  dish.  Human  Synopsis2:  in  this  video  the  woman  takes 
bacon,  eggs,  cheese  ,  onion  in  different  containers.  On  a  pan  she  cooks  the  bacon  on  low  flame.  Side  by  side  she  beats  the  eggs  in  a 
bowl,  she  removes  the  cooked  bacon  on  a  plate.  In  the  pan  she  fries  onions  and  then  adds  the  beaten  eggs.  She  sprinkles  grated  cheese 
on  the  pan  and  cooks  well.  She  then  adds  the  fried  bacon  on  the  eggs  in  the  pan  and  cook  well.  She  transfers  the  cooked  egg  with  bacon 
to  as  serving  plate. 


Figure  6:  Qualitative  results  from  MER12  and  our  “YouCook”  dataset.  Only  the  top  5  sentences  from  our  system  are  shown. 
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